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Abstract —In a data stream management system (DSMS), 
users register continuous queries, and receive result updates as 
data arrive and expire. We focus on applications vrlth real-time 
constraints, in which the user must receive each result update 
within a given period after the update occurs. To handle fast data, 
the DSMS is commonly placed on top of a cloud Infrastructure. 
Because stream properties such as arrival rates can fluctuate 
unpredictably, cloud resources must he dynamically provisioned 
and scheduled accordingly to ensure real-time response. It is 
essential, for the existing systems or future developments, to 
possess the ability of scheduling resources dynamically according 
to the current workload, in order to avoid wasting resources, or 
failing in delivering correct results on time. 

Motivated by this, we propose DRS, a novel dynamic resource 
scheduler for cloud-based DSMSs. DRS overcomes three funda¬ 
mental challenges: (a) how to model the relationship between 
the provisioned resources and query response time (b) where 
to best place resources; and (c) how to measure system load 
with minimal overhead. In particular, DRS includes an accurate 
performance model based on the theory of Jackson open queueing 
networks and is capable of handling arbitrary operator topologies, 
possibly with loops, splits and joins. Extensive experiments with 
real data confirm that DRS achieves real-time response with close 
to optimal resource consumption. 

I. Introduction 

In many applications, such as analytics over microblogs, 
video feeds and sensor readings, data records are not avail¬ 
able beforehand, but gradually and continuously arrive in the 
form of streams. A data stream management system (DSMS) 
handles such streams, and answers long-running, continuous 
queries to users. The results of such a query are delivered in 
the form of a stream of updates. Often, users are interested in 
performing streaming analytics in real time, meaning that each 
result update must reach the user within a given time period 
after the update occurs, i.e., the earliest possible time that it 
can be produced. Eor instance, consider a DSMS monitoring 
surveillance video streams in hospital wards. Events such as 
a patient falling should be detected promptly to alarm doctors 
and nurses in time. 

To deal with fast, high-volume streams and stringent real¬ 
time response requirements, it is increasingly common to place 
the DSMS on top of a cloud infrastructure, which provides 
virtually unlimited computing resources on demand. Because 


key properties of a data stream, including its volume, arrival 
rates, value distribution, etc., can fluctuate in an unpredictable 
manner, the DSMS should ideally dynamically provision cloud 
resources to each application, in order to satisfy the real-time 
constraints with minimum resource consumption. Meanwhile, 
inside an application, resources need to be carefully scheduled 
to different components to ensure optimal utilization. Misplac¬ 
ing resources may cause not only poor resource utilization, but 
instability of the system as a whole. 

Eigure [T] shows an example video stream processing ap¬ 
plication with two operators A (which extracts features from 
input video frames) and B (which recognizes objects from the 
extracted features), with the output of A fed to B as input. The 
record arrival rates for A and B are Aa and kg respectively, 
where Aa depends on the input, e.g., 24 frames per second, 
while Ab depends on the output rate of A, i.e., the number of 
features extracted in unit time. Inside each operator, an input 
is first buffered into an input queue (i.e., qA in A and qg in 
B) before being processed by one of the parallel processors 
(Ai,... ,A„ in A, ,... ,Bm in B). Assuming the cloud provides 
identical processing units, each processor in A (respectively in 
B) can process /Ta {Bb) inputs in a unit of time. Clearly, an 
operator must have sufficient processors to keep up with its 
input rate; otherwise, inputs start to fill its input queue, leading 
to increased latency due to waiting, and, eventually, errors 
when the queue reaches its size limit. Since the data arrival 
rate and processing rate for each processor are uncontrollable, 
the main resource scheduling issue is to determine the number 
of processors in each operator, in our example, n and 


Operator^ Operator .5 



Fig. 1. Example streaming analytics application. 

A simple approach to scheduling resources is to monitor 

^Although there are other types of cloud resources, such as storage and 
network bandwidth, we focus on computation-intensive applications where 
processors are the key resource. 












the workload in each operator, and adjust the number of 
processors accordingly. This method is insufficient in multi¬ 
operator applications. For instance, consider the case that at 
some point, many recognizable objects appear in the video 
stream. Then, although the number of frames per second in 
the input (i.e., X^) remains stable, each frame now contains 
more extractable features, requiring more work at operator A. 
Hence, /Ta decreases, which consequently overloads operator 
A, causing inputs to wait longer in its queue q^, slowing down 
query response. Now, if we naively add processors to A to 
flush qA, operator A then suddenly produces a large amount of 
outputs, leading to a burst in the input rate Xg of operator B, 
overloading the latter. This problem is exacerbated when the 
application involves a complex network of operators. Figure |2] 
shows such an example, with splits (A to B, C), joins (C, D 
to E) and a feedback loop (E to A). Such topological features 
are key enablers for certain applications, e.g., loops allows data 
reduction at the input based on the current query results, as 
we show with an example in Section [V] 



Fig. 2. Example complex operator topology. 


As we review in Section m existing systems largely 
overlook the problem of dynamic resource scheduling. Con¬ 
sequently, to meet the real-time constraint, they either require 
manual tuning at runtime (which is infeasible for dynamic 
streams), overprovisioning resources to each operator (which 
wastes resources), or load shedding (which leads to incorrect 
results). Motivated by this, we design and implement DRS, a 
dynamic resource scheduling module. DRS generally applies 
to operator-based DSMSs, and allows operators to form an 
arbitrary topology, possibly with splits, joins and loops as 
shown in Figure ID In particular, the support for loops can 
be a key enabler for certain applications, especially those 
involving iterations, as we show with an example in Section IVl 
Meanwhile, from a semantics point of view, allowing arbitrary 
topologies is more general than two-step MapUpdate in Mup- 
pet HI, and the DAG model in TimeStream ||2l- 

Our main contributions include effective and efficient so¬ 
lutions to three fundamental problems in dynamic resource 
scheduling: (a) how much resources are needed, (b) where 
to best place the allocated resources to minimize response 
time, and (c) how to implement resource scheduling in a real 
system with minimal overhead. In particular, our solutions to 
the first two problems are based on the theory of extended 
Jackson networks, which provides an educated estimate of 
system performance. 

The rest of the paper is organized as follows. Section HJ 
surveys related work. Section uni presents our performance 
model and optimization algorithm. Section |IV] describes the 
implementation of DRS. Section [Vl contains an extensive set 
of experiments with real data. Section |Vl] concludes with 
directions for future work. 


II. Related Work 
A. Resource Scheduling in Cloud Systems 

A cloud consists of a massive number of interconnected 
commodity servers. A key feature of the cloud is that its 
resources, such as CPU cores, memory, disk space and network 
bandwidth can be provisioned to applications on demand. In 
fact, most cloud infrastructure providers today offer pay-as- 
you-go options for resource usage. Hence, a fundamental re¬ 
quirement for a system to effectively use the cloud is elasticity, 
meaning that the system must be able to dynamically allocate 
and release cloud resources based on the current workload. 
Many traditional parallel and distributed systems, however, 
assume a fixed amount of resources available beforehand, 
rendering them unsuitable to be applied in a cloud platform. 
As a result, many novel elastic cloud-based paradigms and 
systems have emerged in the past decade. 

The first wave of cloud-based systems were built for run¬ 
ning a batch of (often slow) jobs offline. Notably, MapReduce 
0 is a batch processing framework that hides the complexity 
of the cloud infrastructure, and exposes a simple programming 
interface to users consisting of two functions: map (e.g., for 
data Altering and transformation) and reduce (for aggregation 
and join). A plethora of MapReduce systems, improvements, 
techniques, and optimizations have been proposed in recent 
years, and we refer the reader to a comprehensive survey 0. 

Resource scheduling has been a central problem in Map¬ 
Reduce like systems, and a plethora of schedulers have been 
developed and used in production, e.g.. Fair Scheduler 0, 
Capacity Schedular ||6l. Since tasks running on nodes with¬ 
out relevant data incur costly network transmissions, delay 
scheduling 0 reduces such non-local tasks by forcing nodes 
to wait until either a local task appears, or a specified period 
has passed. These scheduling strategies, however, do not apply 
to our problem, because they are designed for offline, batch 
processing of (semi-) static data, where the goal is to minimize 
total job completion time; in contrast, we focus on real-time 
processing of streaming data, where each individual result 
update must be delivered on time. 

Recently, much attention has been shifted to real-time inter¬ 
active systems for big data analytics, such as Dremel 0 , Im- 
pala. Presto 0 , OceanRT HOl, HIl, C-Cube Ull, SADA HS) 
and newer versions of Hive M- Such systems deal with static 
rather than streaming data; meanwhile, the term “real-time” 
here has a different meaning: that each query is executed 
quickly enough so that the user can wait online for its results. 
Hence, resource scheduling in these systems resembles offline 
systems, and their techniques do not apply to our problem 
for similar reasons. Another recent hot topic in cloud-based 
system research is cloud-based stream processing, which is 
most relevant to this work. We review them in Section III-CI 

Finally, there exist generic scheduling solutions for provi¬ 
sioning to multiple applications competing for cloud resources. 
System such as Mesos QSl, YARN Ha are prominent exam¬ 
ples. Abacus HD optimizes total utility by allocating resources 
via a truth-revealing auction. These methods generally assume 
that an application already knows the amount of resources it 
needs, and how to distribute these resources internally, which 
are the problems solved in this paper. Hence, they can be used 
in combination with the proposed solution. 


















B. Traditional DSMSs 

Stream processing has been an important research topic 
in both academia and industry. Earlier work focuses on 
DSMSs in a centralized setting, which resembles the tradi¬ 
tional, centralized database management systems. For instance, 
STREAM D3 establishes formal semantics for queries over 
streams DU, and proposes efficient query processing algo¬ 
rithms, e.g., II 20 I . Similar systems include Aurora ETIl . Gigas- 
cope II 22 I . TelegraphCQ fTJj . and System S 1241 . Scheduling 
in such centralized systems means deciding the best order of 
operators to execute (by the central processor), e.g., in order 
to minimize memory consumption m- Hence, scheduling 
strategies in these systems such as 1251 do not apply to our 
cloud-based setting, where operators are executed by multiple 
nodes in parallel, and computational resources are dynamically 
provisioned on demand. 

Similarly, DSMSs built for traditional parallel settings, 
notably Borealis l2^ . also differ from cloud-based DSMSs 
in that the former assume that a fixed amount of computa¬ 
tional resources available beforehand, rather than dynamically 
allocated. Hence, to our knowledge, no scheduling technique 
along this line of research applies to our problem. Next we 
review cloud-based DSMSs. 

C. Cloud-Based Stream Processing 

There are two general methodologies for processing 
streams in a cloud; using an operator-based DSMS, and 
discretizing stream inputs into mini-batches lIZTl . The former 
derives from traditional DSMSs described in Section III-Bl 
whereas the latter reduces stream processing to batch execu¬ 
tion, explained in Section Hi-Al In general, mini-batch systems 
are optimized for throughput, at the expense of increased query 
response time, since each input must wait until a full batch is 
formed. While it is possible to minimize this extra latency by 
having extremely small batches, doing so would lead to high 
overhead, defeating the purpose. We focus on operator-based 
DSMSs since our target applications have real-time constraints, 
in which response time is key. 

Two popular open source operator-based DSMSs are 
Storm 1^ and S4 1^ . Their main difference is that Storm 
guarantees the correctness of its results (e.g., through its 
Trident component), while S4 does not. Both systems rely 
on manual configurations for resource scheduling. Hence, to 
avoid slow responses due to operator overloading, the user has 
to either overprovision resources to every operator, which is 
wasteful, or continuously tuning the system, which is infeasible 
for dynamic streams. 

Many research prototypes of operator-based DSMSs are 
proposed, such as TimeStream 13, which features efficient 
fault recovery, and Samza 1301. None of these systems, how¬ 
ever, addresses the resource scheduling problem. In the fol¬ 
lowing we present DRS, the first effective resource scheduler 
for cloud-based operator DSMSs. 

HI. Dynamic Resource Scheduling 

Section HlTAI clarifies assumptions in DRS. Section HlTBI 
presents the DRS performance model, which estimates query 


response time given a resource allocation scheme. Sec¬ 
tion IIII-CI describes the DRS dynamic resource schedul¬ 
ing algorithm. Table U summarizes frequently used notations 
throughout the paper. 


TABLE I. Table OF NOTATIONS. 


Symbol 

Meaning 

N 

Total number of operators in an application 

A/ 

Mean arrival rate of inputs to ?-th operator 


Mean arrival rate of inputs to the application 

i^i 

Mean processing rate of inputs to /-th operator 

ki 

# of processors allocated to the /-th operator 

k 

A Vector {ki,... containing all kis. 

Tmax 

Real-time constraint parameter: each input of the application is 
expected to be fully processed within r^ax time. 

^max 

Resource constraint parameter: maximum number of available 
processors that can be allocated to the operators. 

t 

an input tuple to the streaming application 

T 

A random variable on the total sojourn time of a tuple t 


A. Assumptions 

We focus on stream analytics applications, which are 
usually memory-based and computation intensive. For such 
applications, processors are the main type of resource, each 
of which contains a CPU (or one of its cores) and a certain 
amount of RAM. Disk space is not critical as streaming inputs 
are computed on-the-fly. Although networking delay can also 
affect query latency, we do not explicitly model it, because 
(a) it is often correlated with computational costs and (b) 
it can be affected by uncontrollable factors, such as other 
transmission-heavy applications on the same server or in the 
same subnetwork. Further, data centers today are increasingly 
equipped next-generation networking hardware that provide 
significantly higher bandwidth and lower latency, such as lOG 
Ethernet (e.g., in OTI ') and InfiniBand (e.g., as argued in 132), 
whose prices have been dropping rapidly. In contrast, processor 
speed in terms of CPU clock rate and RAM latency has 
stagnated in the past few years. Hence, we assume processors 
to be the bottleneck of the system, not network bandwidth. 

For the ease of presentation, we further assume that all 
processors in the cloud have identical computational power. 
Nevertheless, the proposed models and algorithms can also 
support settings with heterogeneous processors, and we explain 
how this is done whenever necessary. Meanwhile, we assume 
that load balancing is achieved in every operator, i.e., each 
processor inside the same operator performs roughly equal 
amount of work. How to achieve load balancing is an orthog¬ 
onal topic, and it is under active research, e.g., Il3?l . Il34l . 
Under these assumptions, the processing speed of an operator 
depends mainly on the number of processors therein. 

The goal of DRS is to fully process each input of the 
application in real time. Specifically, an input tuple to the ap¬ 
plication, e.g., a video frame in Figure [1] may lead to multiple 
intermediate results, e.g., features extracted by operator A, and 
objects recognized by operator B. We say an input tuple t is 
fully processed, if and only if every intermediate result derived 
from t has been processed by its corresponding operator. We 
use the term total sojourn time to refer to the duration from the 
time that t first arrives at the system to the time that t is fully 
processed. Our goal is then to ensure that the expected total 
sojourn time of each input t is no more than a user-specified 
duration, denoted by Tjnax- 
























B. Performance Model 

Given an application’s operator network, e.g., the one in 
Figure |2] the current resource allocation and characteristics 
of the streaming data, the DRS performance model estimates 
the total sojourn time of an average input of the application, 
explained at the end of last subsection. The current resource 
allocation is represented by the number of processors assigned 
to each operator. Formally, we dehne N as the number of oper¬ 
ators in an application and a resource allocation is modeled by 
a vector k = (^ 1 ,^ 2 ,... where k,(l <i<N) corresponds 
to the number of processors allocated to the /-th operator. 

Regarding data characteristics, the important variables are 
the rate that tuples arrive at each operator, and how fast 
they can be processed by one processor. Networking delay 
is not explicitly expressed in our model, and we discuss this 
issue further at the end of this subsection. Note that our 
model assumes neither deterministic tuple arrival rates nor 
processing times; in other words, instantaneous arrival rates 
and processing times can fluctuate. On the other hand, in 
order to make the problem tractable, we do assume that the 
system remains in a relatively steady state during the span 
that DRS performs modeling and resource scheduling. This 
means that the average tuple arrival rate and processing time 
at each operator remains stable, and we obtain these quantities 
through the measurement module of the system, described in 
Section EYl Specihcally, for the /-th operator (1 < i < N), we 
use Xi to denote the mean arrival rate of its inputs, and /r, to 
denote the mean processing rate of each of its processors. For 
instance, the case of k,- = 3, A, = 10 and /r, = 3 means that on 
average, 10 tuples arrive at the /-th operator in unit time, and 
each of its 3 processors processes 3 tuples in unit time. For 
an operator with multiple input streams, i.e., join operators. A,- 
is the total arrival rates of all its input streams, and /r, is the 
average processing rate of the operator, regardless of which 
input stream the tuple comes from. 

Additionally, we dehne Aq as the mean arrival rate of inputs 
that How into the application’s operator network from outside 
of it. When there are clear “source” operators in the operator 
network whose inputs come entirely from outside the network, 
Xo is simply the total arrival rates of these sources. In general, 
however, there may not be a simple relationship between Aq 
and the set of A,’s, 1 <i <N. For example, in Figure |2] Aq is 
the arrival rate of tuples that come (from outside the system) 
to operator A; the input arrival rate A^ for A on the other hand 
is the sum of Aq and the arrival rate of A’s other input stream, 
produced by operator E. 

We use random variable T to denote the total sojourn time 
of an input to the application. Our goal is to estimate E\T], 
i.e., the expected value of T. The basic idea for estimating 
E\T] is to model the system as an open queuing network 
(OQN) [351, apply known results in queueing theory. In 
OQN, the total sojourn time of an input tuple t is computed 
by summing up its total service time (i.e., total time spent 
on processing t and intermediate results derived from t) and 
total queuing delay (total time that t and its derived tuples 
wait in operator queues). This closely matches our setting. 
The challenge, however, is that there are numerous OQN 
models in the queuing literature, and selecting an appropriate 
one is non-trivial. On one hand, complex queuing network 
models generally do not have known solutions; among the 


ones that do, most have only numerical solutions (rather than 
analytical ones), which renders effective optimization hard; on 
the other hand, an overly simplihed model may rely on strong 
assumptions, such as deterministic tuple arrival rates, which 
do not hold in our setting. After comparing various options 
and testing them through experiments, we chose to build our 
model based on a combination of one of Erlang’s models J^, 
El and the Jackson network lf35l . [38]. The former enables 
effective analysis of each individual operator, and the latter 
helps to aggregate these analyses to estimate E\T\ for the 
whole network. Our model has an analytical solution, and it 
involves only mild limitations, which will be discussed shortly. 


We Hrst focus on a single operator, say the i-th. We use 
7] to denote the time between the arrival of an input of the 
operator and the time when the operator Hnishes processing it. 
We model the operator as an M/M/k; system El, where k; 
is the number of processors for operator i. According to the 
Erlang formula 1^ . E[Ti] is calculated by; 




E[Tm = \ 




( 1 ) 


for k, < 


M.’ 


where no is a normalization term, given by: 
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Intuitively, since new tuples arrive at an average rate A;, 
and each processor processes tuples at an average rate /r,, 
when ki < jf, the processors cannot keep up with incoming 
tuples. Consequently, the number of tuples in the operator 
queue increases with time, leading to inhnite queuing delay. 
When, ki> y , tuples are expected to be handled faster than 
they arrive. However, due to the randomness of the arrival and 
process rates, the queue may still grow when the arrival rate 
is temporarily higher than the processing rate. Clearly, the ex¬ 
pected service time for each tuple is E. The expected queuing 
delay is captured by the complicated term in Equation ([TJ. 

Next we aggregate all ^[lij’s to obtain an estimate of £’[?’] 
for the entire operator network. According to the theory of 
Jackson networks lT5l . If38ll . ^[r] is computed by a weighted 
average of the ^[TjJ’s: 

1 ^ 

E[T](k)^E[T]{kuk2,...M = 3-^A,E[7j](k,). (3) 

Ao;=i 


This completes the DRS performance model. Since our 
model relies on Erlang’s formula and Jackson open queueing 
network, it inherits two limitations. Eirst, the model implicitly 
assumes that both the inter-arrival times of external tuples (that 
come from outside the system) and the service time of the 
operator are i.i.d. samples from random variables following 
the exponential distribution. Second, Jackson network does not 
explicitly model pipelining between different operators. Hence, 
our model may give an inaccurate estimate of £’[?’], when the 
service time or tuple arrival distribution deviates signihcantly 







from the expected exponential distribution, or when pipelining 
affects total processing time considerably. In the meantime, our 
model does not explicitly consider networking costs, due to the 
fact that measuring the networking delay between two nodes 
requires complex inter-node protocols, e.g., for clock syn¬ 
chronization, which can be prohibitively expensive in a real¬ 
time application. Therefore, when networking delay becomes a 
dominant factor in the total sojourn time of an average input, 
our model tends to produce an underestimation of the true 
result. Nevertheless, as we show in the experiments, the value 
of E\T] predicted by our model is sufficiently accurate, when 
the underlying application is computation intensive, which is 
one important assumption made in Section lTlI-AI Further, even 
when the prediction is inaccurate, it is still strongly correlated 
with the exact value of E\T\, meaning that DRS remains 
capable of identifying the best resource allocation with the 
predicted value. In the rest of the section, we show how DRS 
schedules resources based on the performance model. 


C. Scheduling Algorithm 

In a nutshell, DRS (a) monitors the current performance of 
the system (more details in Section IIVI) . (b) checks whether 
the performance falls (or is about to fall) under the real¬ 
time constraint, or when the system can fulfil the constraint 
with less resources, and (c) reschedules resources when (b) 
returns a positive result. The main challenge lies in (b), which 
needs to answer two questions, including how many processors 
are needed to fulfil the real-time requirement, and where to 
place them in the operator network. We first focus on the 
latter question. Specifically, given a number (say, /Tmax) of 
processors, we are to find an optimal assignment of these 
processors to the operators of the application that obtains the 
minimum expected total sojourn time. The problem can be 
mathematically formalized as follows: 

min £'[r](k) 
k 

^ (4) 

s.t. ^ki < Kmax, kiis interger, i = 1,2,... ,N; 

1=1 

A naive solution to the above optimization problem is 
to view it as an integer program, and apply a standard 
solver. However, current integer programming solvers are 
prohibitively slow, especially considering that DRS itself has to 
run in real time. In the following we describe a novel algorithm 
that solves Program (|4]i with negligible cost. 

The key property used in the proposed algorithm is that 
E[Ti]{ki), defined in Equation ([T]l, is a convex function of ki, 
the number of processors assigned to the /-th operator. This 
property has already been proved in It follows from 

the convexity of £’[7]](k,) implies that marginal benefit for 
incrementing k, drops monotonously as k,- becomes larger. 
Formally, for all kj > k,, we have: 

£[7j](k,)-£[7j](k, + l) >£[7j](k')-£[7j](k;-f 1) (5) 

Now observe from Equation (|3]l that E[T] is a weighted 
sum of the ^[Tjjs, and each weight A, is independent of the 
value of ki. Hence, £’[7’] is also a convex function of the 
kjS, meaning that incrementing each k,- also has diminishing 


marginal benefit with respect to E\T]. Based on this obser¬ 
vation, we design a greedy algorithm, listed in Algorithm [T] 
The idea is to start from the smallest possible value of each k,- 
(lines 1-4) and iteratively add one processor to the operator that 
leads to the largest decrease in E\T] (lines 8-15). According to 
Equation ([T]l, each ki must be larger than since otherwise, 
£[7i](k,) becomes infinitely large, leading to infinity on E\T] 
as well. 


Algorithm 1 AssignProcessors 


Input: Amax, Ao, {A,-,i = 1,...,A}, {)!,•,/= 1,...,A}. 
Output: k= (ki,k 2 ,...,k;v) 


for all 1,... ,A do 


/* Initialize each k; */ 


ki ^ 

end for 

if L!Li ki > £max then 

throw an exception saying that the number of processors 
are not sufficient for the application. 

end if 

while LJLi ki < £max do 
for all iA— 1,... ,A do 

5i^Xi- [E[Ti]iki)-E[Ti]{ki + l) 

end for 

/* find the operator with the largest marginal benefit. */ 


12: ; ^argmax,- 5 i 

13: kj i — kj 1 

14: end while 

15: return k= (ki,k 2 ,...,kiv) 


Since E[T] is convex, the above greedy algorithm always 
finds the optimal solution, similar to the case of the server 
reallocation problem 1^ . This is restated as follows: 

Theorem 1: Algorithm [T] always returns exact optimal so¬ 
lution to Program |4] 

The proof is given in Appendix 


Next we focus on the question on how to determine the 
minimum number of processors that are expected to achieve 
real-time processing, i.e., the expected total sojourn time E[T] 
is no larger than a user-defined threshold Tmax- This can be 
modeled with the following optimization problem. 


N 

min ^k,', 

^ 1=1 

s.t. ^[rKk) < w, ki is interger, i = 1,2,..., A; 


(6) 


Similar to Program both constraints and objective 
of Program (|6ll are convex in terms of k. Hence, we solve 
Program ® with a greedy strategy similar to Algorithm [T] 
Specifically, we start by initializing each k,- with minimal 
requirement, as in lines 1 -4 of Algorithm [T] The algorithm re¬ 
peatedly adds one processor to the operator with the maximum 
marginal benefit as in lines 8-15 of Algorithm[T] until E\T] is 
no larger than Tmax- We omit the proof of correctness for this 
algorithm, since it is nearly identical to that of Algorithm [T] 

In practice, the solution of Program (|6]l may not give us 
the precise amount of resources necessary for meeting the 















real-time requirement at all times, for two reasons. First, the 
total sojourn time can be different for every input, and E\T] 
is merely its expected value. Second, the performance model 
described in Section BlI-BI outputs only an estimate of E\T], 
rather than its precise value. To address this problem, DRS 
starts with the number of processors suggested by the solution 
of Program (|6]l, monitors the actual total sojourn time E\T], 
and continuously adjusts the number of processors based on 
the measured value of E[T]. In next section, we discuss the 
system design and implementation issues with DRS. 


IV. System Design 


An overview of the system architecture is presented in 
Figure[2 which generally consists of two layers, the DRS layer 
and the CSP (cloud-based streaming processing) layer. Specif¬ 
ically, DRS layer is responsible for performance measurement, 
resource scheduling and resource allocation control, while CSP 
layer contains the primitive streaming processing logic, e.g. 
running instances of Storm ||28]| and S4 ll29l , and cloud-based 
resource pool service, e.g. YARN ifThll and Amazon EC2. 



Fig. 3. The architecture overview 

While the core of DRS layer is responsible to the opti¬ 
mization of resource scheduling based on the model derived in 
the previous section, the system to support such functionality 
is not that straightforward to build. Given the heterogeneous 
underlying infrastructure and the complicated streaming pro¬ 
cessing applications running on the CSP layer, it is crucial to 
collect the accurate metrics from the infrastructure, aggregate 
the statistics, make online decisions and control the resource 
allocation in an efficient manner. 

To seamlessly combine the optimization model and the 
concrete streaming processing system, we build a number of 
independent functional modules, which bridge the gap between 
the physical infrastructure and abstract performance model. 
As is shown in Figure [2 on the input side of the opti¬ 
mizer component, we have measurer module and configuration 
reader module, which generate the statistics needed by the 
optimizer based on the data/control flow from CSP layer. On 
the output end of the workflow diagram, the scheduler module 
and resource negotiator module transform the decisions of the 
optimizer into executable commands for different streaming 
processing platforms and resource pools. The technical details 
and key features of the modules are discussed in Appendix iBl 


underlying CSP layer. The overview of the important concepts 
and architectural aspects of Storm, and the description of how 
we implement the measurer, scheduler and resource negotiator 
modules of DRS in Storm are provided in Appendix [C] 


A. Testing Applications 

We implement two real-time stream analytics applications: 
video logo detection (VLD) and frequent pattern detection 
(FPD) from different domains. 


Logo Detection from a Video Stream. Given a set of query 
logo images, the logo detection application identifies these 
images from the input video stream. Although much work has 
been done to improve the accuracy and efficiency of VLD, 
performing it in real time remains a major challenge, due to 
the high computational complexity. 



Fig. 4. The topology of real-time video logo detection application. 

Figure |4] illustrates the topology of the real-time VLD 
application, which is a chain of operators containing a spout, 
a feature extractor, a feature matcher, and an aggregator. The 
spout extracts frames from the raw video stream. The output 
rate of frames may vary from time to time due to the generation 
algorithm and the original video contents. We employ scale- 
invariant feature transform (SILT) iol algorithm to extract 
features from each frame. This step is time-consuming, involv¬ 
ing convolutions on the 2-dimensional image space. Moreover, 
the number of result SILT features may vary dramatically on 
different frames, causing significant variance on the compu¬ 
tation overhead over time. The feature matcher measures L 2 
distance between its input SILT features to those pre-generated 
logo features, and outputs matching pairs with distance lower 
than a pre-deflned threshold. Linally, the aggreagator judges 
whether a logo appears in a video frame by aggregating all 
input matching feature pairs, i.e., if the number of matched 
features in a video frame exceeds a threshold, the logo is 
considered to appear in the frame. 

Frequent Pattern Detection over a Microblog Stream. This 
application maintains the frequent patterns on a sliding 
window over a microblog stream from Twitter. Lor each input 
sentence, we append an additional label “+1-”, indicating it is 
entering/leaving the dedicated window. Given a set of input 
item groups in the sliding window and a threshold, we define 
a maximal frequent pattern (MLP) to be the itemset satisfying: 
(a) the number of item groups containing this itemset, called 
its occurrence count, is above the threshold; and (b) the 
occurrence count of any of its superset is below the threshold. 



V. Empirical Studies 

To test the effectiveness of DRS, we have implemented 
i{3 and integrated it into Storm ll28l. which provides the 

^The source code is available online: https://github.eom/ADSC-Cloud/resa/ 


Fig. 5. The topology of the stream frequent pattern detection application. 

Ligure |5] illustrates the operator topology. There are two 
spouts, which generate an event tuple as an itemset en¬ 
ters/leaves the current processing window, respectively. The 
pattern generator generates candidate patterns, i.e., itemsets. 








































































These candidates include an exponential number of possible 
non-empty combinations of items. Hence, its computation 
varies, according to the number of items in recent transactions. 

The detector maintains the state records containing (a) the 
occurrence counts and (b) MFP indicator, of all the candidate 
itemsets. When a state change happens to some itemset, e.g., 
from MFP to non-MFP, the detector outputs a notification to 
the reporter, and also to itself through the loop back link. Since 
(a) each processor in the detector maintains only a portion 
of the state records; and (b) a state change can affect the 
states of other itemsets stored at a different processor, the 
loop ensures that the state change notifications be sent to all 
the instances. Finally, the reporter presents the updates of the 
detection results to the user. In our implementation, the reporter 
simply write its inputs to an HDFS file. 

B. Experiment Setup 

The experiments were run on a cluster of 6 Ubuntu Linux 
machines interconnected by a LAN switch. Each machine 
is equipped with an Intel quad-core CPU 3.4GHz and 8GB 
of RAM. Following common configurations of Storm, we 
allocated one machine to host the Nimbus and the Zookeeper 
Server; the remaining 5 machines host executors for the 
experimental applications. We also configured each of these 
5 machines so that one machine can host at most 5 execu¬ 
tors. The main purpose of this constraint is to mitigate the 
interference caused by other executors running on the same 
machine, and the resource contention due to the over-allocation 
of executors on a single machine. As a result, there are 25 
executors in total. 


standard deviation of the total sojourn times under 6 different 
allocations for each application. The x-axis {xi:x 2 :xj,) denotes 
an resource configuration (in a partial order of xi,X 2 ,X 2 ), 
where xi,X 2 ,X 2 , are the number of executors allocated to 
the operators SIFT Feature Extractor, Feature matcher, and 
Matching aggregator in Figure |4l or the Pattern generator, 
detector, and reporter in Figure |5] The two configurations 
with (10:11:1) for VLD and (6:13:3) for FPD are the 
recommended allocations by the passively running DRS. 
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For both applications, namely video logo detection (VLD) 
and frequent pattern detection (FPD), we allocated two ex¬ 
ecutors as spouts, and one executor for DRS. The remaining 
25 — 3 = 22 executors are used as bolts, i.e., ^fmax = 22. For 
VLD, the input data are a series of videos clips of the soccer 
games, and we selected 16 logos as the detection targets. The 
frame rate simulates a typical Internet video experience, which 
is uniformly distributed in the interval [1,25] with a mean of 
13 frames/second. For FPD, we use a real dataset containing 
28,688,584 tweets from 2,168,939 users collected from Oct. 
2006 to Nov. 2009. We set the sliding window to 50,000 
tweets, and simulated the arrival of tweets to the topology 
following the Poisson process with an average arrival rate of 
320 tweets per second. 

C. Experimental Results 

For both applications, we run two sets of experiments: 
(a) with re-balancing disabled, i.e., we keep DRS running 
passively, meaning that it continues to monitor the system 
performance and recommend new (if better) resource alloca¬ 
tion configurations, but does not perform re-scheduling; (b) 
re-balancing is disabled at the beginning, and then enabled 
at a later time. These experiments aim to test the quality of 
the performance model and evaluate the effectiveness of the 
resource scheduling algorithm of DRS. 

Experiments with re-balancing disabled. In this set, each 
experiment lasts for 10 minutes. Figure |6] shows the mean and 

^This is a term used by Storm, and it has the same meaning as re-scheduling. 


Fig. 6. The mean and standard deviation of the complete sojourn times 
under different resource configurations with re-balancing disabled, where the 
configurations with are the recommended allocations by the passively 
running DRS. 

From Figure |6] we make the following observations. The 
resource configurations (10:11:1) for VLD and (6:13:3) for 
FPD, achieve the best performance according to the measured 
average sojourn time. This turns out to be consistent with 
the recommendations provided by the passively running DRS, 
which validates the accuracy and effectiveness of our DRS 
performance model and resource scheduling algorithm. 

In particular, these two configurations not only obtain 
the smallest average sojourn times, but also the minimum 
standard deviation, meaning that these two allocations lead to 
the smallest performance oscillations. Different configurations, 
including the 5 closest ones in terms of the Li distance (i.e., 
the remaining 5 in the experiment) to the best configurations 
(10:11:1) for VLD and (6:13:3) for FPD, all exhibit consider¬ 
ably worse performance. These results demonstrate that it is 
not trivial to find the optimal resource allocation especially 
when the application topology becomes more complicated 
(e.g. more than three bolt operators), and hence reveal the 
importance and usefulness of the DRS. 

To take a close look at how DRS provides resource 
configuration recommendations, correctly. Figure |7] shows the 
relationship between the measured average sojourn times and 
the estimated average sojourn time, which is derived by the 
performance model described in Section IIII-BI of the six 



















resource allocation configurations for both VLD and FPD, with 
re-balancing disabled. 
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number of loops). We used 30 executors running on 6 physical 
machines, connected in the same subnetwork. The results are 
reported in Figured 
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Fig. 8. The degree of underestimation (the ratio of the measured to the 
estimated average sojourn time) v.s. the total CPU time of the three bolts of 
the synthetic chain topology 


As shown in Figure |8] We tried 6 different workloads in 
terms of total CPU time (excluding the queue time) of the 
three bolts, from 0.567 millisecond, to 309.1 milliseconds (x- 
axis, log-scale), and the y-axis shows the ratio of the measured 
average sojourn time to the estimated value. It shows a clear 
decreasing trend of the degree of underestimation (ratio of the 
measured to the estimated average sojourn time) as the total 
CPU time of the three bolts increases. 


Fig. 7. Comparing average sojourn time estimated by the model and measured 
in the experiment. 

As shown in Figure |7] the points representing the measured 
and the estimated average sojourn time are showing the strictly 
monotonicity, which signifies that the performance model is ca¬ 
pable of suggesting the best resource allocation configuration. 
Moreover, the performance model outputs accurate estimates 
for VLD; though, with some slight underestimation comparing 
to the measured values, which is expected, as our model does 
not consider network overhead. It is worth noting that the 
estimates are accurate even though the underlying conditions 
are not satisfied for the Jackson network theory and Erlang 
model. For example, the frame rate is uniformly (rather than 
exponential as required) distributed. Meanwhile, the operator 
input queues do not follow strict FIFO rule; instead, tuples 
are hashed to processors. Different operators are also run in 
parallel, which leads to pipelining. The model is clearly robust 
to these variations of the conditions. 

For FPD, the estimated sojourn times show larger de¬ 
viations to the measured ones. This is mainly because the 
model does not consider network transmission cost, which 
takes a dominant portion of the total query latency in this 
particular application. In other words, the FPD is de facto the 
type of data intensive rather than the computation intensive 
application that we focus on. Nevertheless, our model still 
correctly indicates the relative order of the performance of 
different resource allocation configurations. Meanwhile, since 
the estimates are strongly correlated with the true values, 
a polynomial regression can be used straightforwardly to 
make accurate predictions of the true latency value given the 
estimated one. 

To further validate the above explanation, we carried out 
a separate experiment over a synthetic topology with a simple 
chain of three operators. Each operator simply performs some 
computations (such as empty for-loops) with varying load (e.g.. 


Experiments with re-balancing disabled first and then 
enabled. In this set of experiments, we investigate the per¬ 
formance of the real re-scheduling operation activated and 
executed by the DRS when it detects the non-optimal resource 
allocation configurations. Eor each experiment, it lasts for 27 
minutes and the re-balancing function is disabled from the 
beginning till the end of the 13th minute, and becomes enabled 
afterwards. In this way, we are able to have a clear view of 
the performance (in terms of the average sojourn time) across 
the re-scheduling events. 



Experiment time (minute) 

Fig. 9. The average sojourn times of three different allocations in the initial 
state for each application, where re-balancing function is disabled from the 
beginning till the end of the 13th minute, and becoming and keeping enabled 
since the 14th minute. 



























Figure |9] shows three curves for each of the applications. 
In particular, each curve represents an initial allocation. For 
both applications, the two curves initially with the non-optimal 
allocations, experience the re-scheduling events at the 14th 
minute, while the one with the optimal allocation as its initial 
state does not. From Figure |9] we can see that optimizer 
triggers the re-scheduling action as early as possible, which 
responds quickly to the less promising resource scheduling 
plan. After the re-scheduling, all the curves with different 
initial allocations, were scheduled with the unique optimal 
solution. This statement is supported by two facts; a) from 
Figure |9] after the 14th minute, all the three curves have the 
similar average sojourn time, and similar performance trends. 
Especially for the two curves that experience the re-allocation 
event, it shows a clear decrease in the average sojourn time; b) 
the plans kept in the log files further verify this observation. 

Another observation we make, according to the four curves 
experiencing the re-scheduling events shown in Figure |9] - 
(8:12:2) and (11;9;2) of VLD and (8:12:2) and (7:13:2) of 
FPD - is that our improved version of re-balancing mechanism 
led to remarkably low cost, i.e. a neglectable increment in the 
average sojourn time within the 14th minute only. Besides, the 
whole re-balancing process of ours only takes a few seconds, 
comparing to the 1-2 minutes taken by Storm’s default version. 

Next, we investigate how DRS adjusts resources when it 
detects the resource shortage/wastage according to the con¬ 
figured parameter Two experiments on VLD application 
are conducted and each one lasts for 27 minutes and the re¬ 
balancing function is disabled from the beginning till the end of 
the 13th minute, and becomes enabled afterwards. The average 
tuple complete sojourn time of the two experiments in each 
minute is plotted in Figure (TO] In particular, for “ExpA”, we 
set Tmax = 500 (ms) and in the initial state, 4 workers with 
Amax = 17 are allocated; and for “ExpB”, we set T^ax = 1000 
(ms) and initially, 5 workers with /Tmax = 22 are allocated. 
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Fig. 10. The average sojourn times under two configurations of the VLD 
application, where re-balancing function is disabled from the beginning till 
the end of the 13th minute, and becoming and keeping enabled since the 14th 
minute. For “ExpA”, we set Tmax = 500 (ms) and in the initial state, 4 workers 
with A'max = 17 executors are allocated; and for “ExpB”, we set Tmax = 1000 
(ms) and initially, 5 workers with /fmax = 22 executors are allocated. 


As shown in Eigure[Tol the curve of “ExpA” keeps with the 


allocation configuration (8:8:1) which is actually the suggested 
allocation when /Tmax = 17 (by DRS algorithm), for the first 13 
minutes. It has experienced the larger average tuple complete 
sojourn time than the configured Tmax = 500 (ms). On the 14th 
minute right after the re-balancing function is enabled, the 
DRS quickly triggers the re-scheduling operation including; 
(a) initializing and adding an extra machine (thus 5 more 
executors); and (b) calculating the recommended allocation 
configuration (10;11;1) when K^ax becomes 22. Since then, 
the curve of “ExpA” is stably below the target requirement 
of Tmax = 500 (ms). On the other hand, the curve of “ExpB” 
shows a totally opposite shape to that of “ExpA”, which is 
just as expected; it initially keeps with the configuration of 
(10:11:1) till the end of the 13th minutes. Afterwards, DRS 
triggers the re-balancing operation and makes “ExpB” using 
less resources, i.e., 4 machines, K^ax — 17 and (8;8;1), but still 
satisfying the performance requirement Tmax = 1000 (ms). 

Similar to the observations we made on Eigure |9] the 
cost incurred by our improved version of the re-balancing 
mechanism in “ExpA” and “ExpB” are again much lower than 
that of Storm’s default version, as demonstrated by Eigure [TOl. 
Particularly, “ExpB” just experiences an increase to about 1113 
(ms) in average sojourn time in the 14th minute, whereas the 
overhead of “ExpA” is larger, an increase to around 4777 (ms). 
This is mainly because of the different actions taken during the 
re-scheduling, i.e., in “ExpA”, new machines are initialized and 
added to the running topology, in which case, reusing JVMs 
has no effects; in contrast in “ExpB”, it only needs to stop and 
remove some existing working machines. Therefore, there is 
still room for improvement on our version of the re-balancing 
mechanism, which we consider as the future work. 

The running overhead of the DRS. To evaluate the com¬ 
putation overhead of the overall DRS layer, we report the 
CPU time spent by the whole DRS module, including the 
processing on measurement results and calculating the optimal 
allocation. In this experiment, we only test on the video logo 
detection topology composed by three bolt operators with all 
the parameters, h and /x,, i = 1,2,3 hxed. We try different 
Kmax, i e. total number of executors for all operators. Eor 
each value of Kmax, we run the procedure 100,000 times and 
report the average running time of the whole DRS layer. The 
results are listed in Table HU with Scheduling as the allocation 
computation and Measurement as the metric processing com¬ 
putation. Generally speaking, the computation done by DRS 

TABLE II. Computation overheads in milliseconds under 
DIEEERENTXmax- 


^max 

12 

24 

48 

96 

192 

Scheduling 

0.083 

0.158 

0.323 

0.665 

1.250 

Measurement 

0.100 

0.100 

0.100 

0.100 

0.100 


is almost neglectable, with overhead less than milliseconds in 
most of the cases. Moreover, the results are consistent with our 
intuition that the computation consumption is linear to Kmax, as 
analyzed over Algorithm [T] The time consumed on processing 
the measurement results is irrelevant to Kmax- In fact, it is 
affected by the total number of tasks of the topology, as we 
will discuss in Appendix ICl that this number keeps immutable 
when the topology is continuously running. 






















VI. Conclusion 

This paper proposes DRS, a novel dynamic resource sched¬ 
uler for real-time streaming analytics in a cloud-based DSMS. 
DRS overcomes several fundamental challenges, including the 
estimation of the required resources necessary for satisfying 
real-time requirements, effective and efficient resource provi¬ 
sioning and scheduling, and the efficient implementation of 
such a scheduler in a cloud-based DSMS. The performance 
model of DRS is based on rigorous queuing theory, and it 
demonstrates robust performance even when the underlying 
conditions of the theory are not fully satisfied. In addition, 
we have integrated DRS into a popular system Storm, and 
evaluated it by conducting extensive experiments based on real 
applications and datasets. 

Regarding future work, we plan to investigate efficient 
strategies for migrating the system from the current resource 
configuration to the new one recommended by DRS. This 
step should minimize additional overhead and result latency 
during migration, as well as the migration duration, (e.g., HU). 
Another interesting direction for future work is to investigate 
the possibility of improving performance model accuracy with 
more sophisticated queuing theory. 
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Appendix A 
Proof of Theorem [T] 

Proof: (sketch): Let k be the output of AssignProcessors 
shown in Algorithm [T] and k* be an optimal assignment that 
minimizes £'[7’]. Choose any two operators x and y satisfying 
that k* > kx and k* < ky. According to the facts that (a) 
AssignProcessors always increments the number of processors 
for the operators with the highest marginal benefit (lines 10 
- 14); and (b) the diminishing marginal benefit property in 
Inequality (|5]l, we derive the following inequality: 


E[Ty]{k;)-E[Ty]{k; + i) 




E[Txm-l)-E[Tx]{k*x) 


In other words, in k*, taking one processor away from operator 
X and assigning it to operator y leads to a value of £’[?’] that is 
no worse than before. This can be done repeatedly to gradually 
change k* to k, without increasing E[T]. Hence, £’[?’] (k) < 
£’[r](k*). Since k* is optimal, k must be optimal as well. ■ 


Appendix B 

Technical details of the three DRS modules 
A. Measurer Module 

The measurer module is mainly responsible for the mea¬ 
surement on the CSP layer and the pre-processing of the 
metrics before sending them to the optimizer component. 
Recall from Algorithm [T] in Section lTlI-CI that for each operator 
i of an application running on the CSP layer, it is essential to 
collect two local metrics of the operator: the average aggregate 
tuple arrival rate, denoted by A,, and the average service rate, 
denoted by /2,. In addition, the optimizer component also 
needs certain global metrics for its optimization algorithm, 
i.e., metrics related to individual tuples and multiple operators, 
which include the average arrival rate of external tuples, 
denoted by Ao, and the average tuple complete sojourn time, 
£’[7'], as described in Section UlI-CI 

There are two major technical challenges to the measurer 
module in DRS layer. Firstly, the operators and the instances 
within the operators may run on different physical machines 
during online stream processing. Therefore, the measurement 
must be conducted collaboratively in a distributed environ¬ 
ment. Secondly, it is important for the measurer module to 
minimize the overhead of the measurement itself and maintain 
the high availability of the streaming processing service. 


To tackle the challenges listed above, the measurer module 
in our system is designed as an independent system operator, 
mostly invisible to the system user and programmer. To collect 
the local metrics, a group of optional measurement logics are 
injected into the executables on each instance of the operators, 
such that specified local metrics are forcefully collected and 
kept in the memory of the distributed nodes. A pull-based 
mechanism is employed to control the data flow from the 
operators of the topology to the measurement operator. To 
limit the overhead of distributed metric collection, a bi-layer 
sampling strategy is applied to the system. Specifically, each 
instance of the operators records the metric of a tuple every A,,, 
local input tuples, while the centralized measurement operator 
pulls updates from the other operators every seconds. 

To collect the global metric with respect to external tuples 
coming into the system, the measurement operator tracks the 
processing tree of the tuple, using existing techniques, e.g. 
acknowledgment mechanism. Therefore, the measurement op¬ 
erator receives notifications from the underlying infrastructure 
on the completion of processing tree of the external tuples, and 
thus retrieves global metrics based on the notification time. 

After the collection of original metrics, the system still 
needs to go through pre-processing operations to eliminate the 
effects of noises, message loss and outliers. The operations 
include: (a) result aggregation at the operator level. This is 
crucial because the metrics we have defined and are interested 
in are at the operator level (e.g., the Jackson network), rather 
than the instance level, which may only contain some propor¬ 
tion; and (b) results smoothing. It helps to reduce the effects 
of noise and improve stability of the system. There are two 
options of smoothness operations supported in our system. 
d{n) is used to denote the measurement results of the «th 
interval collected and aggregated by the controller, and D{n) 
is used to denote the smoothed results after the nth interval. 
The first smoothness option is a-weighed averaging, in which 
we have D{n) — aD{n — 1) -f (1 — a)d{n), with a € [0,1) as a 
tunable parameter controlling the fading rate of old metrics. 
The second smoothness option is window-based averaging, 
in which we have D[n) = ^Y!]=n-w+\‘^U)^ w as the 
windows size parameter. 

B. Scheduler and Negotiator Modules 

Based on the optimization model specified by the user, 
the measurer returns two types of optimization results, which 
minimize the latency based on available resource and mini¬ 
mize the computation resource based on the maximal latency. 
Since the optimization output only indicates the amount of 
resources assigned to the particular operators, the system could 
execute the results, only when a concrete mapping between the 
available resource and operator is constructed. The scheduler 
module and resource negotiator module thus play important 
roles to work as a translator in the system architecture. 

The output of Algorithm [1] is an optimal solution, given the 
currently maximum available resources £max- The scheduler 
first checks if the optimal solution is the same as the current 
allocation, which is read through the configuration reader as 
a type of input parameter. If not, then the scheduler will 
trigger the “resource allocation” module in the CSP layer, 
to conduct a re-allocation. One technical difficulty is that the 










implementation of the scheduler must be coupled closely with 
the CSP layer, e.g., the API calls and how the “resource 
allocation” works, as well. 

In a practical CSP system, resource allocation always incur 
costs, such as processing data migration, intermediate state 
save and load, etc. In the long run, the optimal solution of 
Program (|4|i may adapt, due to the change of the input data 
rate, or change of the data properties which further affect the 
service time of each component. It is also possible that the 
current allocation becomes sub-optimal, but its performance 
in terms of the tuples’ average complete sojourn time is not 
far from the best that can be achieved. 

In consequence, it is necessary to make another decision 
by the scheduler, i.e., given the optimal allocation and its 
expected performance (derived through our analytical queueing 
model), and the currently working allocation and the measured 
average complete sojourn time, and considering the cost (input 
as a parameter), whether it is beneficial enough to make the 
reallocation happen. 

On the other hand, finding the minimum required amount 
of resources, i.e.. Program (|6]l is meaningless if we interpret 
the output only at the logical resource unit (pool) level, but 
meaningful at the physical resource layer. The motivation of 
doing this must be explained in a practical way, e.g., with 
less available physical resources (which are controlled by the 
resource manager of the CSP layer), it saves costs, in terms of: 
a) expenditure for renting the virtual machines from the cloud 
services especially when the budget is tight; and b) power 
consumption, e.g., for local machines. 

Therefore, the resource negotiator works at an even lower 
layer than the “resource manager” of the CSP layer. It negoti¬ 
ates with the physical machines or the cloud service provider 
by implementing several dedicated APIs, e.g., one of the most 
important must be launching/stopping the “resource manager” 
daemon process of the CSP layer. 

C. Configuration Reader Module 

The configuration reader is designed to be a general inter¬ 
face for managing a data structure containing the configuration 
parameters provided by either the users or the CSP layer. We 
list part of the parameters: a) the type of the optimization 
problem, i.e.. Program (|4|i or Program (|6]i; b) the corresponding 
^max and Tmax for the algorithms in the optimizer; c) for the 
measurer, e.g., sampling rate Nm, trigger interval T„, and a or w 
for the smooth processing; d) for the scheduler, e.g., the current 
running allocation of the CSP layer, and the re-allocation cost. 

Appendix C 

Overview oe Storm and our implementation oe the 
DRS MODULES 

An application running on Storm is defined by a topology, 
with vertices as user-defined operators (containing computation 
logics) and edges as indicators of data flows between operators. 
There are two types of operators in Storm, spouts and bolts. 
A spout acts as a data source, which connects to external 
streaming sources. Bolts include all other (i.e., non-source) 
operators. Each operator contains one or more processors, 
called executors, running on different servers in the cloud. 


Storm supports dynamically “re-scaling” an operator (spout 
or a bolt), which changes its number of executors. This 
is implemented by decoupling the routing logics from the 
computation logics. The routing logics remain the same even 
when new executors are added. Storm’s implementation is 
based on a partitioning scheme on each operator (spout or 
bolt), in which each partition is called a task. When an operator 
scales out (respectively in), the number of executors of the 
operators increases (decreases), with the tasks reassigned to 
the executors. In particular, there are different partitioning rules 
supported by Storm, e.g., shuffle, field and direct grouping. We 
refer the reader to 1^ for details of the partitioning rules. 

Given the architecture of Storm system, resource 
allocation/re-allocation can be controlled by assigning different 
numbers of executors to operators. Storm also provides an 
internal mechanism for migrating to a new resource con¬ 
figuration, called re-balancing. Simply put, the re-balancing 
mechanism suspends the entire system (e.g., by shutting 
down all the Java Virtual Machines), modifies the executor 
to operator mappings and routing, and Anally resumes the 
system. Hence, the response time becomes very high during 
re-balancing. Therefore, in the real implementation of DRS 
and the experiment, we developed and used our own version 
(which involves coding at the Storm core layer in Clojure) 
of the re-balancing mechanism, with significant improvements 
over Storm’s default version. Discussions on how to migrate to 
a new resource configuration without such costly system-wide 
suspensions is out of the scope of this paper. The most essential 
improvement we have made is to re-use the JVMs. Finally, 
Storm provides a scheduler interface that enables customized 
executor assignment strategy, and allows users to specify the 
operation frequency of the scheduler. 

Measurer: We implemented two new system operators (not 
visible to users) into the Storm system, called Measurable- 
Spout and MeasurableBolt. They wrap a normal bolt/spout, and 
add measurement logics. The measurement for bolts mainly 
records the elapsed time volumes the execution function spends 
on each of the incoming tuples. These measured results are 
collected periodically by the “DRSMetricCollecor” module, 
which is implemented using the Measurement APIs provided 
by Storm. To measure the queue related metrics, e.g., the 
average tuple arrival rate to each operator i, is more com¬ 
plicated, because there is no available API we can make use 
of. Therefore, we had to modify the source code of the Storm 
core to add the measurement logics. Note the rate measurement 
position should be at the tail of the operator queue, instead of 
the queue head. 

Scheduler: Since Storm provides the scheduling APIs, our 
testing platform simply calls these APIs, which reassigns the 
executors, calls our version of the re-balancing function, and 
continues processing the incoming data stream automatically. 

Configuration Reader: Similarly, the configuration reader 
reuses the APIs of the Storm system, which shares the config¬ 
uration in Zookeeper. 

Negotiator: The negotiator is at a lower level than the resource 
manager of the Storm. It is in charge of starting/shutting down 
extra/existing physical resources (e.g., physical machines or 
virtual machines). Our negotiator module is based on the APIs 
of YARN, on top of a Hadoop cluster. 


